This paper proposes a safe data-driven control framework for nonlinear systems with partially known dynamics. The method ensures stability and constraint satisfaction during online learning, assuming only a stabilizable linear approximation of the process is available. Unmodeled nonlinear dynamics are captured by a Gaussian process residual learned in real time. Safety is enforced through a probabilistic control-invariant set derived from Lyapunov theory, guaranteeing high-probability stability. A convex quadratic program computes control inputs that maximize information gain while respecting probabilistic safety constraints. The framework provides finite-sample safety guarantees and allows adaptive expansion of the invariant set as uncertainty decreases. Numerical results validate the approach, demonstrating safe and informative exploration under model uncertainty: the safe set expands by about 30% while the Gaussian process root-mean-square error drops from 1.11 to 0.03.
While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.
Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instances exactly but scales exponentially in state dimensions. Black-box reinforcement learning handles high-dimensional states but trains slowly and produces no sensitivities. We introduce SNAPO (Smooth Neural Adjoint Policy Optimization), a framework that embeds a neural policy inside a known, differentiable simulator, replaces hard constraints with smooth approximations, and computes exact gradients of the objective with respect to all policy parameters and all inputs in a single adjoint pass. We demonstrate SNAPO on three domains: natural gas storage (training in under a minute, 365 forward curve sensitivities at no additional cost per sensitivity), pension fund asset-liability management (6.5x-200x sensitivity speedup over bump-and-revalue, scaling with the number of risk factors), and pharmaceutical manufacturing (cross-unit sensitivities through a 4-unit process chain, with 20 ICH Q8 regulatory sensitivities from 5 adjoint passes in 74.5 milliseconds). All sensitivities are produced by the same backward pass that trains the policy, at a cost proportional to one reverse pass regardless of how many sensitivities are computed.
Here, we explore the problem of error propagation mitigation in modular digital twins as a sequential decision process. Building on a companion study that used a Hidden Markov Model (HMM) to infer latent error regimes from surrogate-physics residuals, we develop a Markov Decision Process (MDP) in which the inferred regimes serve as states, corrective interventions serve as actions, and a scalar reward that takes into consideration the cost-benefit tradeoff between system fidelity and maintenance expense. The baseline transition matrix is extracted from the HMM-learned parameters. We then extend the formulation to a Partially Observable MDP (POMDP) that accounts for the imperfect nature of regime classification by maintaining a belief distribution updated via Bayesian filtering, with the HMM confusion matrix serving as the observation model. Both formulations are solved via dynamic programming and validated through Gillespie stochastic simulation. We then benchmark two model-free reinforcement learning algorithms, Q-learning and REINFORCE, to assess whether effective policies can be learned without explicit model knowledge. A systematic comparison of different intervention policies demonstrates that the MDP policy achieves the highest cumulative reward and fraction of time in nominal operation, while the POMDP recovers approximately 95\% of MDP performance under realistic observation noise. Sensitivity analyses across observation quality, repair probability, and discount factor confirm the robustness of these conclusions, and the major gaps in the policy hierarchy are statistically significant at $p < 0.001$. The gap between MDP and POMDP performance quantifies the value of information providing a principled criterion for investing in improved classification accuracy.
This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.
As artificial intelligence systems (AIs) become increasingly produced by recursive self-improvement, a form of evolution may emerge, in which the traits of AI systems are shaped by the success of earlier AIs in designing and propagating their descendants. There is a rich mathematical theory modeling how behavioral traits are shaped by biological evolution, but AI evolution will be radically different: biological DNA mutations are random and approximately reversible, but descendant design in AIs will be strongly directed. Here we develop a mathematical model of evolution in self-designing AI systems, replacing random mutations with a directed tree of possible AI programs. Current programs determine the design of their descendants, while humans retain partial control through a "fitness function" that allocates limited computational resources across lineages. We show that evolutionary dynamics reflects not just current fitness but factors related to the long-run growth potential of descendant lineages. Without further assumptions, fitness need not increase over time. However, assuming bounded fitness and a fixed probability that any AI reproduces a "locked" copy of itself, we show that fitness concentrates on the maximum reachable value. We consider the implications of this for AI alignment, specifically for cases where fitness and human utility are not perfectly correlated. We show in an additive model that if deception increases fitness beyond genuine utility, evolution will select for deception. This risk could be mitigated if reproduction is based on purely objective criteria, rather than human judgment.
We consider the dynamic resource allocation problem where the decision space is finite-dimensional, yet the solution must satisfy a large or even infinite number of constraints revealed via streaming data or oracle feedback. We model this challenge as an Online Semi-infinite Linear Programming (OSILP) problem and develop a novel LP formulation to solve it approximately. Specifically, we employ function approximation to reduce the number of constraints to a constant $q$. This addresses a key limitation of traditional online LP algorithms, whose regret bounds typically depend on the number of constraints, leading to poor performance in this setting. We propose a dual-based algorithm to solve our new formulation, which offers broad applicability through the selection of appropriate potential functions. We analyze this algorithm under two classical input models-stochastic input and random permutation-establishing regret bounds of $O(q\sqrt{T})$ and $O\left(\left(q+q\log{T})\sqrt{T}\right)\right)$ respectively. Note that both regret bounds are independent of the number of constraints, which demonstrates the potential of our approach to handle a large or infinite number of constraints. Furthermore, we investigate the potential to improve upon the $O(q\sqrt{T})$ regret and propose a two-stage algorithm, achieving $O(q\log{T} + q/ε)$ regret under more stringent assumptions. We also extend our algorithms to the general function setting. A series of experiments validates that our algorithms outperform existing methods when confronted with a large number of constraints.
The optimal transportation problem defines a geometry of probability measures which leads to a definition for weighted averages (barycenters) of measures, finding application in the machine learning and computer vision communities as a signal processing tool. Here, we implement a barycentric coding model for measures which are supported on a graph, a context in which the classical optimal transport geometry becomes degenerate, by leveraging a Riemannian structure on the simplex induced by a dynamic formulation of the optimal transport problem. We approximate the exponential mapping associated to the Riemannian structure, as well as its inverse, by utilizing past approaches which compute action minimizing curves in order to numerically approximate transport distances for measures supported on discrete spaces. Intrinsic gradient descent is then used to synthesize barycenters, wherein gradients of a variance functional are computed by approximating geodesic curves between the current iterate and the reference measures; iterates are then pushed forward via a discretization of the continuity equation. Analysis of measures with respect to given dictionary of references is performed by solving a quadratic program formed by computing geodesics between target and reference measures. We compare our novel approach to one based on entropic regularization of the static formulation of the optimal transport problem where the graph structure is encoded via graph distance functions, we present numerical experiments validating our approach, and we conclude that intrinsic gradient descent on the probability simplex provides a coherent framework for the synthesis and analysis of measures supported on graphs.
The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.
Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.